20 research outputs found
An Efficient Reliable Broadcast Protocol
Many distributed and parallel applications can make good use of broadcast communication. In this paper we present a (software) protocol that simulates reliable broadcast, even on an unreliable network. Using this protocol, application programs need not worry about lost messages. Recovery of communication failures is handled automatically and transparently by the protocol. In normal operation, our protocol is more efficient than previously published reliable broadcast protocols. An initial implementation of the protocol on 10 MC68020 CPUs connected by a 10 Mbit/sec Ethernet performs a reliable broadcast in 1.5 msec
SiL: An Approach for Adjusting Applications to Heterogeneous Systems Under Perturbations
Scientific applications consist of large and computationally-intensive loops.
Dynamic loop scheduling (DLS) techniques are used to load balance the execution
of such applications. Load imbalance can be caused by variations in loop
iteration execution times due to problem, algorithmic, or systemic
characteristics (also, perturbations). The following question motivates this
work: "Given an application, a high-performance computing (HPC) system, and
both their characteristics and interplay, which DLS technique will achieve
improved performance under unpredictable perturbations?" Existing work only
considers perturbations caused by variations in the HPC system delivered
computational speeds. However, perturbations in available network bandwidth or
latency are inevitable on production HPC systems. Simulator in the loop (SiL)
is introduced, herein, as a new control-theoretic inspired approach to
dynamically select DLS techniques that improve the performance of applications
on heterogeneous HPC systems under perturbations. The present work examines the
performance of six applications on a heterogeneous system under all above
system perturbations. The SiL proof of concept is evaluated using simulation.
The performance results confirm the initial hypothesis that no single DLS
technique can deliver best performance in all scenarios, while the SiL-based
DLS selection delivered improved application performance in most experiments
High-performance parallel programming in Java: exploiting native libraries
With most of today's fast scientific software written in Fortran and C, Java has a lot of catching up to do. In this paper we discuss how new Java programs can capitalize on high-performance libraries for other languages. With the help of a tool we have automatically created Java bindings for several standard libraries: MPI, BLAS, BLACS, PBLAS and ScaLAPACK. The purpose of the additional software layer introduced by the bindings is to resolve the interface problems between different programming languages such as data type mapping, pointers, multidimensional arrays, etc. For evaluation, performance results are presented for Java versions of two benchmarks from the NPB and PARKBENCH suites on the IBM SP2 using JDK and IBM's high-performance Java compiler, and on the Fujitsu AP3000 using Toba - a Java-to-C translator. The results confirm that fast parallel computing in Java is indeed possible
Massively parallel computing in Java
Although Java was not specifically designed for the computationally intensive numeric applications that are the typical fodder of highly parallel machines, its widespread popularity and portability make it an interesting candidate vehicle for massively parallel programming. With the advent of high-performance optimizing Java compilers, the open question is: How can Java programs best exploit massive parallelism? The authors have been contemplating this question via libraries of Java-routines for specifying and coordinating parallel codes. It would be most desirable to have these routines written in 100%-Pure Java; however, a more expedient solution is to provide Java wrappers (stubs) to existing parallel coordination libraries, such as MPI. MPI is an attractive alternative, as like Java, it is portable. We discuss both approaches here. In undertaking this study, we have also identified some minor modifications of the current language specification that would make 100%-Pure Java parallel programming more natural
Toward a Standard Interface for User-Defined Scheduling in OpenMP
Parallel loops are an important part of OpenMP programs. Efficient scheduling of parallel loops can improve performance of the programs. The current OpenMP specification only offers three options for loop scheduling, which are insufficient in certain instances. Given the large number of other possible scheduling strategies, standardizing each of them is infeasible. A more viable approach is to extend the OpenMP standard to allow a user to define loop scheduling strategies within her application. The approach will enable standard-compliant application-specific scheduling. This work analyzes the principal components required by user-defined scheduling and proposes two competing interfaces as candidates for the OpenMP standard. We conceptually compare the two proposed interfaces with respect to the three host languages of OpenMP, i.e., C, C++, and Fortran. These interfaces serve the OpenMP community as a basis for discussion and prototype implementation supporting user-defined scheduling in an OpenMP library
OpenMP Loop Scheduling Revisited: Making a Case for More Schedules
In light of continued advances in loop scheduling, this work revisits the OpenMP loop scheduling by outlining the current state of the art in loop scheduling and presenting evidence that the existing OpenMP schedules are insufficient for all combinations of applications, systems, and their characteristics. A review of the state of the art shows that due to the specifics of the parallel applications, the variety of computing platforms, and the numerous performance degradation factors, no single loop scheduling technique can be a 'one-fits-all' solution to effectively optimize the performance of all parallel applications in all situations. The impact of irregularity in computational workloads and hardware systems, including operating system noise, on the performance of parallel applications, results in performance loss and has often been neglected in loop scheduling research, in particular, the context of OpenMP schedules. Existing dynamic loop self-scheduling techniques, such as trapezoid self-scheduling, factoring, and weighted factoring, offer an unexplored potential to alleviate this degradation in OpenMP due to the fact that they explicitly target the minimization of load imbalance and scheduling overhead. Through theoretical and experimental evaluation, this work shows that these loop self-scheduling methods provide a benefit in the context of OpenMP. In conclusion, OpenMP must include more schedules to offer a broader performance coverage of applications executing on an increasing variety of heterogeneous shared memory computing platforms